{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "075f77c1-3328-4265-b077-160d496eb048",
   "metadata": {},
   "source": [
    "# How to build a reliable, curated, and accurate RAG system using Cleanlab and Pinecone"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aa91666c",
   "metadata": {},
   "source": [
    "Retrieval-Augmented Generation (RAG) is a powerful technique in natural language processing that combines the strengths of large language models with external knowledge retrieval. RAG systems enhance the capabilities of AI models by allowing them to access and utilize relevant information from a curated knowledge base, leading to more accurate, up-to-date, and context-aware responses. Building a reliable, curated, and accurate RAG system is crucial for several reasons:\n",
    "\n",
    "- **Improved Accuracy:** By retrieving relevant information, RAG systems can provide more precise and factual responses.\n",
    "- **Reduced Hallucinations:** Access to external knowledge helps minimize the risk of AI models generating false or inconsistent information.\n",
    "- **Adaptability:** RAG systems can be easily updated with new information without retraining the entire model.\n",
    "- **Trustworthiness:** The ability to trace responses back to source documents enhances explainability and trust in AI-generated content.\n",
    "\n",
    "In this notebook, we'll explore how to build such a system using [Cleanlab](https://cleanlab.ai/) and Pinecone.\n",
    "Cleanlab is a company that specializes in data-centric AI, focusing on improving the quality and reliability of AI/ML systems. Cleanlab's Trustworthy Language Model (TLM) is designed to add trust and reliability to AI model outputs and indicate when it's unsure of an answer, making it ideal for applications where unchecked hallucinations could be problematic. This is especially the case for anyone trying to build generative AI applications in production. \n",
    "\n",
    "Pinecone, as a vector database solution, excels in storing and retrieving high-dimensional vectors, making it perfect for managing and querying large-scale document embeddings.\n",
    "\n",
    "By combining Cleanlab's TLM with Pinecone's vector database capabilities, we can create a RAG system that not only retrieves relevant information efficiently but also ensures the quality and reliability of the data being used. This notebook will guide you through the process of:\n",
    "\n",
    "1. Using Cleanlab's TLM to tag and clean document data, removing low-quality chunks and personally identifiable information (PII).\n",
    "2. Leveraging Pinecone to create and manage a vector database for storing and retrieving document embeddings.\n",
    "3. Implementing a RAG pipeline that uses both tools to provide accurate and trustworthy responses.\n",
    "4. Utilizing Cleanlab's TLM to classify metadata, enhance retrieval, and evaluate the trustworthiness of RAG outputs.\n",
    "\n",
    "By the end of this notebook, you'll have a robust understanding of how to build a RAG system that prioritizes reliability, curation, and accuracy, setting a strong foundation for developing trustworthy AI applications.\n",
    "\n",
    "\n",
    "![Reliable RAG with Pinecone and Cleanlab](assets/cleanlab_pinecone_RAG.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e2a8f008",
   "metadata": {},
   "source": [
    "Let's install dependencies to run the notebook and then import our libraries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "c447c74a",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install pinecone==5.0.1 sentence-transformers==3.0.1 cleanlab-studio==2.2.1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "c70303b2-ce6f-4a03-a460-38bd27e8ae9b",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import time\n",
    "import warnings\n",
    "import matplotlib.pyplot as plt\n",
    "import os\n",
    "import pinecone\n",
    "import uuid\n",
    "from pinecone import ServerlessSpec\n",
    "from sentence_transformers import SentenceTransformer\n",
    "from typing import List, Tuple, Dict, Optional\n",
    "from difflib import SequenceMatcher\n",
    "import re\n",
    "\n",
    "warnings.filterwarnings('ignore')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "id": "c7ff0b6b-3a2b-4e6a-a41a-b0f7a54e32d2",
   "metadata": {},
   "outputs": [],
   "source": [
    "if \"PINECONE_API_KEY\" not in os.environ:\n",
    "    os.environ[\"PINECONE_API_KEY\"] = input(\"Please enter your Pinecone API key: \")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e7183c0f-c96f-4335-bc48-d474628dbe30",
   "metadata": {
    "id": "iMSW1p7td0yB"
   },
   "source": [
    "Below is an example of how to set a Python index serverless specification. This allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/guides/indexes/understanding-indexes)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "28911e33-870c-4150-8bb8-e12f3352b9e2",
   "metadata": {
    "id": "KpRYFxYXd0yB"
   },
   "outputs": [],
   "source": [
    "cloud = os.environ.get(\"PINECONE_CLOUD\") or \"aws\"\n",
    "region = os.environ.get(\"PINECONE_REGION\") or \"us-east-1\"\n",
    "\n",
    "spec = ServerlessSpec(cloud=cloud, region=region)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2bf9ee9f-255f-4709-b72c-f81ea118c81a",
   "metadata": {},
   "source": [
    "## Fetch and Load Documents Data\n",
    "\n",
    "First let's fetch the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "255c00f3",
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget -nc https://cleanlab-public.s3.amazonaws.com/Datasets/documents-RAG-demo.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e34dd561",
   "metadata": {},
   "source": [
    "Let's read in our documents data that we will use in this workflow. The data columns include the `filename` of the document and the document `text`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "e6df812f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(103, 3)\n"
     ]
    }
   ],
   "source": [
    "# Read in dataset\n",
    "df = pd.read_csv(\"documents-RAG-demo.csv\")\n",
    "print(df.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "61f490ad",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>index</th>\n",
       "      <th>filename</th>\n",
       "      <th>text</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>documents/Blackstone-Third-Quarter-2023-Invest...</td>\n",
       "      <td>Blackstone Third Quarter 2023 Investor Call Oc...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>documents/8k-nike.pdf</td>\n",
       "      <td>SECURITIES AND EXCHANGE COMMISSIONFORM 8-K Cur...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>documents/FY24-Q1-NIKE-Press-Release.pdf</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>documents/10-K 2022-Apple2.pdf</td>\n",
       "      <td>The future principal payments for the Company’...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>documents/q3-fy22-earnings-presentation.pdf</td>\n",
       "      <td>Financial\\tpresentation\\tto\\t accompany\\tmanag...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   index                                           filename  \\\n",
       "0      0  documents/Blackstone-Third-Quarter-2023-Invest...   \n",
       "1      1                              documents/8k-nike.pdf   \n",
       "2      2           documents/FY24-Q1-NIKE-Press-Release.pdf   \n",
       "3      3                     documents/10-K 2022-Apple2.pdf   \n",
       "4      4        documents/q3-fy22-earnings-presentation.pdf   \n",
       "\n",
       "                                                text  \n",
       "0  Blackstone Third Quarter 2023 Investor Call Oc...  \n",
       "1  SECURITIES AND EXCHANGE COMMISSIONFORM 8-K Cur...  \n",
       "2                                                NaN  \n",
       "3  The future principal payments for the Company’...  \n",
       "4  Financial\\tpresentation\\tto\\t accompany\\tmanag...  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0f8219ec",
   "metadata": {},
   "source": [
    "## Use Cleanlab's TLM to tag your documents with a topic"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c9f745e2",
   "metadata": {},
   "source": [
    "For our RAG system, it would be ideal to tag each of our documents with a particular topic that is most relevant (based on the text content), which can be used as [metadata](https://docs.pinecone.io/guides/data/filter-with-metadata) we can filter with during retrieval in our RAG system. \n",
    "\n",
    "We can now use Cleanlab's [Trustworthy Language Model](https://cleanlab.ai/blog/trustworthy-language-model/) to tag our document chunks with the correct document topic that we can later use to enhance our retrieval process of fetching the correct context from our vector DB."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a4cc3362",
   "metadata": {},
   "source": [
    "Cleanlab's Trustworthy Language Model (TLM) is a more reliable LLM that gives high-quality outputs and indicates when it is unsure of the answer to a question, making it suitable for applications where unchecked hallucinations are a show-stopper."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e7e0a4b0",
   "metadata": {},
   "source": [
    "### Installing Cleanlab TLM\n",
    "\n",
    "Using TLM requires a [Cleanlab](https://app.cleanlab.ai/) account. Sign up for one [here](https://cleanlab.ai/signup/) if you haven't yet. If you've already signed up, check your email for a personal login link.\n",
    "\n",
    "Let's initialize the TLM client. Here we use powerful TLM settings, but check out the [TLM quickstart tutorial](https://help.cleanlab.ai/tlm/tutorials/tlm/) for configuration options to get results tailored to your use case.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e5308022",
   "metadata": {},
   "source": [
    "## Using the TLM\n",
    "\n",
    "You can use the TLM pretty much like any other LLM API:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "0f764f6d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from cleanlab_studio import Studio\n",
    "\n",
    "key = input(\"Please enter your Cleanlab API key: \")\n",
    "\n",
    "# Get API key from here: https://app.cleanlab.ai/account after creating an account.\n",
    "studio = Studio(key)\n",
    "\n",
    "# Instantiate TLM\n",
    "tlm = studio.TLM()\n",
    "\n",
    "# Prompt the user to enter their own prompt\n",
    "user_prompt = input(\"Enter your prompt: \")\n",
    "\n",
    "output = tlm.prompt(user_prompt)\n",
    "\n",
    "print(f\"Output: {output}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "47c191e8",
   "metadata": {},
   "source": [
    "Let's now use Cleanlab's TLM to do classification and classify the text (tag) into different topics. We will make use of code from [Cleanlab's TLM Zero-Shot Classification Tutorial](https://help.cleanlab.ai/tutorials/zero_shot_classification/) to do this. This includes the two helper functions,`parse_category()` and `classify()`, that can be found below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "2ab085d3",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Helper Functions\n",
    "\n",
    "def parse_category(\n",
    "    response: str,\n",
    "    categories: List[str],\n",
    "    disable_warnings: bool = False\n",
    ") -> str:\n",
    "    \"\"\"\n",
    "    Extracts one of the provided categories from the response using regex patterns.\n",
    "    \n",
    "    If no category out of the possible categories is directly mentioned in the response, \n",
    "    the category with greatest string similarity to the response is returned (along with a warning).\n",
    "    \n",
    "    Args:\n",
    "        response (str): Response from the LLM\n",
    "        categories (List[str]): List of expected categories\n",
    "        disable_warnings (bool): If True, print warnings are disabled\n",
    "    \n",
    "    Returns:\n",
    "        str: The extracted or best-matching category\n",
    "    \"\"\"\n",
    "    response_str = str(response)\n",
    "    escaped_categories = [re.escape(output) for output in categories]\n",
    "    categories_pattern = \"(\" + \"|\".join(escaped_categories) + \")\"\n",
    "\n",
    "    exact_matches = re.findall(categories_pattern, response_str, re.IGNORECASE)\n",
    "    if len(exact_matches) > 0:\n",
    "        return str(exact_matches[-1])\n",
    "\n",
    "    best_match = max(\n",
    "        categories, key=lambda x: SequenceMatcher(None, response_str, x).ratio()\n",
    "    )\n",
    "    similarity_score = SequenceMatcher(None, response_str, best_match).ratio()\n",
    "\n",
    "    if similarity_score < 0.5:\n",
    "        warning_message = f\"None of the categories remotely match raw LLM output: {response_str}.\\nReturning the last entry in the constrain outputs list.\"\n",
    "        best_match = categories[-1]\n",
    "    else:\n",
    "        warning_message = f\"None of the categories match raw LLM output: {response_str}\"\n",
    "\n",
    "    if not disable_warnings:\n",
    "        warnings.warn(warning_message)\n",
    "\n",
    "    return best_match\n",
    "\n",
    "def classify(texts: List[str], categories: List[str], prompt_template: str) -> Tuple[List[str], List[float]]:\n",
    "    \"\"\"\n",
    "    Classifies a list of texts into predefined categories using a language model.\n",
    "    \n",
    "    Args:\n",
    "        texts (List[str]): List of texts to classify\n",
    "        categories (List[str]): List of possible categories\n",
    "        prompt_template (str): Template string for formatting the prompt\n",
    "    \n",
    "    Returns:\n",
    "        Tuple[List[str], List[float]]: A tuple containing two lists:\n",
    "            - List of predicted categories for each text\n",
    "            - List of trustworthiness scores for each prediction\n",
    "    \"\"\"\n",
    "    prompts = [prompt_template.format(text=text) for text in texts]\n",
    "    outputs = tlm.prompt(prompts)\n",
    "    \n",
    "    responses = [output['response'] for output in outputs]\n",
    "    trustworthiness_scores = [output['trustworthiness_score'] for output in outputs]\n",
    "    \n",
    "    predictions = [parse_category(response, categories) for response in responses]\n",
    "    \n",
    "    return predictions, trustworthiness_scores\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a6346850",
   "metadata": {},
   "source": [
    "Now we can define the prompt and categories we will use to tag our documents with the correct topic."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "ada8d9a9",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Use Cleanlab's TLM to tag your documents with a topic\n",
    "tagging_prompt = \"\"\"\n",
    "You are an assistant for tagging text as one of several topics. The available topics are:\n",
    "\n",
    "1. 'finance': Related to financial matters, budgeting, accounting, investments, lending, or monetary policies.\n",
    "2. 'hr': Pertaining to Human Resources, including hiring, employee documents (such as a W4 form), employee management, benefits, or workplace policies.\n",
    "3. 'it': Covering Information Technology topics such as software development, network infrastructure, cybersecurity, or tech support.\n",
    "4. 'product': Dealing with a specific company product, product development, management, features, or lifecycle.\n",
    "5. 'sales': Involving selling a product, customer acquisition, revenue generation, or sales performance.\n",
    "\n",
    "If you are not sure which topic to tag the text with, then answer 'unknown'. Only use the lower case text version of the topic name.\n",
    "\n",
    "Task: Analyze the following text and determine the topic it belongs to. Return the topic as a string.\n",
    "\n",
    "Now here is the Text to verify:\n",
    "\n",
    "Text: {text}\n",
    "\n",
    "Topic: \n",
    "\"\"\"\n",
    "\n",
    "categories = ['finance', 'hr', 'it', 'product', 'sales', 'unknown']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fa0eefa1",
   "metadata": {},
   "source": [
    "Next we will use our helper functions to tag our documents with the correct topic."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "4e974234",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Querying TLM... 100%|██████████|\n"
     ]
    }
   ],
   "source": [
    "predictions, trustworthiness_scores = classify(df[\"text\"].tolist(), categories, tagging_prompt)\n",
    "\n",
    "topics_df = df.copy()\n",
    "topics_df['topic'] = predictions\n",
    "topics_df['topic_trustworthiness'] = trustworthiness_scores"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "5b920a8a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>index</th>\n",
       "      <th>filename</th>\n",
       "      <th>text</th>\n",
       "      <th>topic</th>\n",
       "      <th>topic_trustworthiness</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>documents/Blackstone-Third-Quarter-2023-Invest...</td>\n",
       "      <td>Blackstone Third Quarter 2023 Investor Call Oc...</td>\n",
       "      <td>finance</td>\n",
       "      <td>0.995676</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>documents/8k-nike.pdf</td>\n",
       "      <td>SECURITIES AND EXCHANGE COMMISSIONFORM 8-K Cur...</td>\n",
       "      <td>finance</td>\n",
       "      <td>0.995666</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>documents/FY24-Q1-NIKE-Press-Release.pdf</td>\n",
       "      <td>NaN</td>\n",
       "      <td>unknown</td>\n",
       "      <td>0.962286</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>documents/10-K 2022-Apple2.pdf</td>\n",
       "      <td>The future principal payments for the Company’...</td>\n",
       "      <td>hr</td>\n",
       "      <td>0.263197</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>documents/q3-fy22-earnings-presentation.pdf</td>\n",
       "      <td>Financial\\tpresentation\\tto\\t accompany\\tmanag...</td>\n",
       "      <td>finance</td>\n",
       "      <td>0.962706</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   index                                           filename  \\\n",
       "0      0  documents/Blackstone-Third-Quarter-2023-Invest...   \n",
       "1      1                              documents/8k-nike.pdf   \n",
       "2      2           documents/FY24-Q1-NIKE-Press-Release.pdf   \n",
       "3      3                     documents/10-K 2022-Apple2.pdf   \n",
       "4      4        documents/q3-fy22-earnings-presentation.pdf   \n",
       "\n",
       "                                                text    topic  \\\n",
       "0  Blackstone Third Quarter 2023 Investor Call Oc...  finance   \n",
       "1  SECURITIES AND EXCHANGE COMMISSIONFORM 8-K Cur...  finance   \n",
       "2                                                NaN  unknown   \n",
       "3  The future principal payments for the Company’...       hr   \n",
       "4  Financial\\tpresentation\\tto\\t accompany\\tmanag...  finance   \n",
       "\n",
       "   topic_trustworthiness  \n",
       "0               0.995676  \n",
       "1               0.995666  \n",
       "2               0.962286  \n",
       "3               0.263197  \n",
       "4               0.962706  "
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Display results\n",
    "topics_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "548099d7",
   "metadata": {},
   "source": [
    "As seen above, after running `classify()`, you will notice two new columns in the dataset: \n",
    "\n",
    "- `topic`, a column with the response that we prompted for \n",
    "- `topic_trustworthiness`, a corresponding trustworthiness score, which quantifies how confident you can be that the response is correct"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "40623b63",
   "metadata": {},
   "source": [
    "Now we can use the [Trustworthiness Score](https://help.cleanlab.ai/tutorials/tlm/#how-does-the-tlm-trustworthiness-score-work) obtained in our results (as `topic_trustworthiness`) to analyze which of our topic responses are the most trustworthy and least trustworthy.\n",
    "\n",
    "For our use case, at a low enough threshold score for `topic_trustworthiness`, we can replace the `topic` value for those responses with `unknown` since we can't be sure to trust those responses.\n",
    "\n",
    "This threshold can be determined by sorting the results by the trustworthiness score and then looking for a cutoff point in the results when topic values seem untrustworthy.\n",
    "\n",
    "In practice, generally speaking, if you have the time/resources, your team can manually review low-trustworthiness responses and provide a better human response instead. If not, you can determine a trustworthiness threshold below which responses seem untrustworthy, and automatically append a warning statement to any response whose trustworthiness falls below the threshold or flag those responses in the appropriate way for your use case.\n",
    "\n",
    "The overall magnitude/range of the trustworthiness scores may differ between datasets, so we recommend selecting any thresholds to be application-specific. First consider the relative trustworthiness levels between different data points before considering the overall magnitude of these scores for individual data points."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "e129a070",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>index</th>\n",
       "      <th>filename</th>\n",
       "      <th>text</th>\n",
       "      <th>topic</th>\n",
       "      <th>topic_trustworthiness</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>7</td>\n",
       "      <td>documents/10-K 2022-Apple.pdf</td>\n",
       "      <td>Apple Inc.CONSOLIDATED STATEMENTS OF SHAREHOLD...</td>\n",
       "      <td>finance</td>\n",
       "      <td>0.999823</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>13</td>\n",
       "      <td>documents/Walmart_2022_Investor_Information.pdf</td>\n",
       "      <td>Corporate and Stock InformationListing New Yor...</td>\n",
       "      <td>finance</td>\n",
       "      <td>0.999793</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>12</td>\n",
       "      <td>documents/ir-policy_sept-2021.pdf</td>\n",
       "      <td>INVESTOR RELATIONS POLICYINVESTOR RELATIONS1IN...</td>\n",
       "      <td>finance</td>\n",
       "      <td>0.999789</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>9</td>\n",
       "      <td>documents/IBM_2Q_2017_10-Q.pdf</td>\n",
       "      <td>Table of ContentsUNITED STATES SECURITIES AND ...</td>\n",
       "      <td>finance</td>\n",
       "      <td>0.999772</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>36</th>\n",
       "      <td>36</td>\n",
       "      <td>documents/managers-new-employee-onboarding-gui...</td>\n",
       "      <td>Manager’s New Employee Onboarding GuideWelcome...</td>\n",
       "      <td>hr</td>\n",
       "      <td>0.999741</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>6</td>\n",
       "      <td>documents/DK_DB_InvestorPres_FINAL.ppt</td>\n",
       "      <td>Dara Khosrowshahi\u000bEVP &amp; Chief Financial Office...</td>\n",
       "      <td>finance</td>\n",
       "      <td>0.999723</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>60</th>\n",
       "      <td>60</td>\n",
       "      <td>documents/Technology-Use-Permission.pdf</td>\n",
       "      <td>Please return completed form at Back to School...</td>\n",
       "      <td>it</td>\n",
       "      <td>0.999592</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>45</th>\n",
       "      <td>45</td>\n",
       "      <td>documents/2012_14.doc</td>\n",
       "      <td>Calendar Year 2012 September 12, 2012 Volume 2...</td>\n",
       "      <td>hr</td>\n",
       "      <td>0.999582</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>54</th>\n",
       "      <td>54</td>\n",
       "      <td>documents/Computer-Equipment-Request.pdf</td>\n",
       "      <td>COMPUTER EQUIPMENT REQUEST FORM Please use thi...</td>\n",
       "      <td>it</td>\n",
       "      <td>0.999550</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>18</td>\n",
       "      <td>documents/Form-10-Q-Servicenow.pdf</td>\n",
       "      <td>Table of ContentsUNITED STATESSECURITIES AND E...</td>\n",
       "      <td>finance</td>\n",
       "      <td>0.999476</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    index                                           filename  \\\n",
       "7       7                      documents/10-K 2022-Apple.pdf   \n",
       "13     13    documents/Walmart_2022_Investor_Information.pdf   \n",
       "12     12                  documents/ir-policy_sept-2021.pdf   \n",
       "9       9                     documents/IBM_2Q_2017_10-Q.pdf   \n",
       "36     36  documents/managers-new-employee-onboarding-gui...   \n",
       "6       6             documents/DK_DB_InvestorPres_FINAL.ppt   \n",
       "60     60            documents/Technology-Use-Permission.pdf   \n",
       "45     45                              documents/2012_14.doc   \n",
       "54     54           documents/Computer-Equipment-Request.pdf   \n",
       "18     18                 documents/Form-10-Q-Servicenow.pdf   \n",
       "\n",
       "                                                 text    topic  \\\n",
       "7   Apple Inc.CONSOLIDATED STATEMENTS OF SHAREHOLD...  finance   \n",
       "13  Corporate and Stock InformationListing New Yor...  finance   \n",
       "12  INVESTOR RELATIONS POLICYINVESTOR RELATIONS1IN...  finance   \n",
       "9   Table of ContentsUNITED STATES SECURITIES AND ...  finance   \n",
       "36  Manager’s New Employee Onboarding GuideWelcome...       hr   \n",
       "6   Dara Khosrowshahi\n",
       "EVP & Chief Financial Office...  finance   \n",
       "60  Please return completed form at Back to School...       it   \n",
       "45  Calendar Year 2012 September 12, 2012 Volume 2...       hr   \n",
       "54  COMPUTER EQUIPMENT REQUEST FORM Please use thi...       it   \n",
       "18  Table of ContentsUNITED STATESSECURITIES AND E...  finance   \n",
       "\n",
       "    topic_trustworthiness  \n",
       "7                0.999823  \n",
       "13               0.999793  \n",
       "12               0.999789  \n",
       "9                0.999772  \n",
       "36               0.999741  \n",
       "6                0.999723  \n",
       "60               0.999592  \n",
       "45               0.999582  \n",
       "54               0.999550  \n",
       "18               0.999476  "
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sorted_topic_df = topics_df.sort_values(\n",
    "    by=\"topic_trustworthiness\", ascending=False\n",
    ").copy()\n",
    "sorted_topic_df.head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "d734a696",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>index</th>\n",
       "      <th>filename</th>\n",
       "      <th>text</th>\n",
       "      <th>topic</th>\n",
       "      <th>topic_trustworthiness</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>8</td>\n",
       "      <td>documents/LD_Trucost_Company_Presentation_0504...</td>\n",
       "      <td>Quantitative Environmental Performance Measure...</td>\n",
       "      <td>unknown</td>\n",
       "      <td>0.705944</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>59</th>\n",
       "      <td>59</td>\n",
       "      <td>documents/internetsafety.ppt</td>\n",
       "      <td>Santa Rosa District SchoolsINTERNET SAFETYNove...</td>\n",
       "      <td>unknown</td>\n",
       "      <td>0.673393</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>61</th>\n",
       "      <td>61</td>\n",
       "      <td>documents/eligible-services.ppt</td>\n",
       "      <td>Eligible Services \u000b\u000bService Provider TrainingS...</td>\n",
       "      <td>unknown</td>\n",
       "      <td>0.671809</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>63</th>\n",
       "      <td>63</td>\n",
       "      <td>documents/TecNewEquipmentRequest.pdf</td>\n",
       "      <td>Grade Level/Dept StaffYes Cost: ______________...</td>\n",
       "      <td>unknown</td>\n",
       "      <td>0.618944</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>35</th>\n",
       "      <td>35</td>\n",
       "      <td>documents/Research-Assistant.pdf</td>\n",
       "      <td>Job Description and Responsibilities forResear...</td>\n",
       "      <td>unknown</td>\n",
       "      <td>0.598968</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>20</td>\n",
       "      <td>documents/UnreleasedGames.doc</td>\n",
       "      <td>EVERY GAME THAT WAS NEVER RELEASED FOR THE SPE...</td>\n",
       "      <td>unknown</td>\n",
       "      <td>0.581846</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>42</th>\n",
       "      <td>42</td>\n",
       "      <td>documents/HR-Documents-Quick-Start-Guide.pdf</td>\n",
       "      <td>GreenEmployee.com HR DocumentsThis quick start...</td>\n",
       "      <td>unknown</td>\n",
       "      <td>0.566623</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27</th>\n",
       "      <td>27</td>\n",
       "      <td>documents/Corrective-Action-Disciplinary-Actio...</td>\n",
       "      <td>Revision February 2008\\nGuideline for BPPM 60....</td>\n",
       "      <td>unknown</td>\n",
       "      <td>0.354750</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>documents/10-K 2022-Apple2.pdf</td>\n",
       "      <td>The future principal payments for the Company’...</td>\n",
       "      <td>hr</td>\n",
       "      <td>0.263197</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>79</th>\n",
       "      <td>79</td>\n",
       "      <td>documents/specs-verticalprintinspectors.pdf</td>\n",
       "      <td>Specifications - Vertical Print InspectorsMode...</td>\n",
       "      <td>unknown</td>\n",
       "      <td>0.262263</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    index                                           filename  \\\n",
       "8       8  documents/LD_Trucost_Company_Presentation_0504...   \n",
       "59     59                       documents/internetsafety.ppt   \n",
       "61     61                    documents/eligible-services.ppt   \n",
       "63     63               documents/TecNewEquipmentRequest.pdf   \n",
       "35     35                   documents/Research-Assistant.pdf   \n",
       "20     20                      documents/UnreleasedGames.doc   \n",
       "42     42       documents/HR-Documents-Quick-Start-Guide.pdf   \n",
       "27     27  documents/Corrective-Action-Disciplinary-Actio...   \n",
       "3       3                     documents/10-K 2022-Apple2.pdf   \n",
       "79     79        documents/specs-verticalprintinspectors.pdf   \n",
       "\n",
       "                                                 text    topic  \\\n",
       "8   Quantitative Environmental Performance Measure...  unknown   \n",
       "59  Santa Rosa District SchoolsINTERNET SAFETYNove...  unknown   \n",
       "61  Eligible Services \n",
       "\n",
       "Service Provider TrainingS...  unknown   \n",
       "63  Grade Level/Dept StaffYes Cost: ______________...  unknown   \n",
       "35  Job Description and Responsibilities forResear...  unknown   \n",
       "20  EVERY GAME THAT WAS NEVER RELEASED FOR THE SPE...  unknown   \n",
       "42  GreenEmployee.com HR DocumentsThis quick start...  unknown   \n",
       "27  Revision February 2008\\nGuideline for BPPM 60....  unknown   \n",
       "3   The future principal payments for the Company’...       hr   \n",
       "79  Specifications - Vertical Print InspectorsMode...  unknown   \n",
       "\n",
       "    topic_trustworthiness  \n",
       "8                0.705944  \n",
       "59               0.673393  \n",
       "61               0.671809  \n",
       "63               0.618944  \n",
       "35               0.598968  \n",
       "20               0.581846  \n",
       "42               0.566623  \n",
       "27               0.354750  \n",
       "3                0.263197  \n",
       "79               0.262263  "
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sorted_topic_df.tail(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "57021901",
   "metadata": {},
   "source": [
    "After sorting the results in descending order by the `topic_trustworthiness` score, the results seem to be less trustworthy for scores less than 0.8.\n",
    "\n",
    "Let's replace each of these less trustworthy`topic` responses with `unknown` now."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "a4db9c91",
   "metadata": {},
   "outputs": [],
   "source": [
    "topics_df.loc[topics_df[\"topic_trustworthiness\"] < 0.8, \"topic\"] = (\n",
    "    \"unknown\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aea06568",
   "metadata": {},
   "source": [
    "To get a sense of the distribution of topics we've tagged in our documents data, let's look at the distribution now:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "41847c29",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Topic Column Distribution:\n",
      "topic\n",
      "hr         27\n",
      "unknown    21\n",
      "product    18\n",
      "finance    17\n",
      "sales      16\n",
      "it          4\n",
      "Name: count, dtype: int64\n",
      "\n",
      "Topic Percentage Distribution:\n",
      "topic\n",
      "hr         26.213592\n",
      "unknown    20.388350\n",
      "product    17.475728\n",
      "finance    16.504854\n",
      "sales      15.533981\n",
      "it          3.883495\n",
      "Name: proportion, dtype: float64\n"
     ]
    }
   ],
   "source": [
    "topic_counts = topics_df[\"topic\"].value_counts(dropna=False)\n",
    "print(f\"\\nTopic Column Distribution:\\n{topic_counts}\")\n",
    "\n",
    "# Percentage distribution of topics\n",
    "topic_percentages = topics_df[\"topic\"].value_counts(normalize=True) * 100\n",
    "print(f\"\\nTopic Percentage Distribution:\\n{topic_percentages}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a2925cae",
   "metadata": {},
   "source": [
    "## Initialize RAG Pipeline using our documents data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "id": "a4bf075a",
   "metadata": {},
   "outputs": [],
   "source": [
    "class PineconeRAGPipeline:\n",
    "    def __init__(\n",
    "        self,\n",
    "        model_name: str = \"paraphrase-MiniLM-L6-v2\",\n",
    "        index_name: str = \"document-index\",\n",
    "        cloud: str = \"aws\",\n",
    "        region: str = \"us-east-1\",\n",
    "    ):\n",
    "        \"\"\"\n",
    "        Initialize the PineconeRAGPipeline with a specified model and index name.\n",
    "\n",
    "        Args:\n",
    "            model_name (str): Name of the SentenceTransformer model to use.\n",
    "            index_name (str): Name of the Pinecone index to create or connect to.\n",
    "            cloud (str): Cloud provider for Pinecone.\n",
    "            region (str): Region for the Pinecone service.\n",
    "        \"\"\"\n",
    "        self.model = SentenceTransformer(model_name)\n",
    "        if not os.environ.get(\"PINECONE_API_KEY\"):\n",
    "            os.environ[\"PINECONE_API_KEY\"] = \"YOUR PINECONE API KEY HERE\"\n",
    "        self.pc = pinecone.Pinecone(api_key=os.environ.get(\"PINECONE_API_KEY\"))\n",
    "        self.index_name = index_name\n",
    "\n",
    "        existing_indexes = self.pc.list_indexes()\n",
    "\n",
    "        if self.index_name not in existing_indexes:\n",
    "            try:\n",
    "                print(f\"Creating new index: {self.index_name}\")\n",
    "                self.pc.create_index(\n",
    "                    name=self.index_name,\n",
    "                    dimension=self.model.get_sentence_embedding_dimension(),\n",
    "                    metric=\"cosine\",\n",
    "                    spec=pinecone.ServerlessSpec(cloud=cloud, region=region),\n",
    "                )\n",
    "            except Exception as e:\n",
    "                if \"ALREADY_EXISTS\" in str(e):\n",
    "                    print(\n",
    "                        f\"Index {self.index_name} already exists. Connecting to existing index.\"\n",
    "                    )\n",
    "                else:\n",
    "                    raise e\n",
    "        else:\n",
    "            print(\n",
    "                f\"Index {self.index_name} already exists. Connecting to existing index.\"\n",
    "            )\n",
    "\n",
    "        self.index = self.pc.Index(self.index_name)\n",
    "\n",
    "    def chunk_text(self, text: str, max_tokens: int = 256) -> List[str]:\n",
    "        \"\"\"\n",
    "        Split text into chunks based on a maximum token size.\n",
    "\n",
    "        Args:\n",
    "            text (str): The document text to be chunked.\n",
    "            max_tokens (int): The maximum number of tokens per chunk.\n",
    "\n",
    "        Returns:\n",
    "            List[str]: List of text chunks.\n",
    "        \"\"\"\n",
    "        words = text.split()\n",
    "        chunks = []\n",
    "        current_chunk = []\n",
    "        current_chunk_tokens = 0\n",
    "\n",
    "        for word in words:\n",
    "            word_tokens = len(self.model.tokenize([word])[\"input_ids\"][0])\n",
    "            if current_chunk_tokens + word_tokens > max_tokens and current_chunk:\n",
    "                chunks.append(\" \".join(current_chunk))\n",
    "                current_chunk = []\n",
    "                current_chunk_tokens = 0\n",
    "\n",
    "            current_chunk.append(word)\n",
    "            current_chunk_tokens += word_tokens\n",
    "\n",
    "        if current_chunk:\n",
    "            chunks.append(\" \".join(current_chunk))\n",
    "\n",
    "        for i, chunk in enumerate(chunks):\n",
    "            print(\n",
    "                f\"Chunk {i+1} length: {len(chunk)} characters, \"\n",
    "                f\"{len(self.model.tokenize([chunk])['input_ids'][0])} tokens\"\n",
    "            )\n",
    "\n",
    "        return chunks\n",
    "\n",
    "    def index_documents(self, df: pd.DataFrame) -> int:\n",
    "        \"\"\"\n",
    "        Index documents from a DataFrame with specific metadata structure.\n",
    "\n",
    "        Args:\n",
    "            df (pd.DataFrame): DataFrame containing document information and metadata.\n",
    "                               Expected columns: 'text', 'filename', 'topic'\n",
    "\n",
    "        Returns:\n",
    "            int: The number of chunks successfully indexed.\n",
    "        \"\"\"\n",
    "        valid_docs = []\n",
    "        valid_metadata = []\n",
    "        generated_ids = []\n",
    "\n",
    "        print(\"Starting document processing...\")\n",
    "\n",
    "        for idx, row in df.iterrows():\n",
    "            if pd.isna(row[\"text\"]) or pd.isna(row[\"filename\"]) or pd.isna(row[\"topic\"]):\n",
    "                print(f\"Skipping invalid document at index {idx}: {row['filename']}\")\n",
    "                continue\n",
    "\n",
    "            doc = str(row[\"text\"])\n",
    "            print(f\"Processing document {row['filename']} at index {idx}...\")\n",
    "\n",
    "            chunks = self.chunk_text(doc)\n",
    "\n",
    "            for i, chunk in enumerate(chunks):\n",
    "                chunk_id = str(uuid.uuid4())\n",
    "                chunk_metadata = {\n",
    "                    \"filename\": row[\"filename\"],\n",
    "                    \"topic\": row[\"topic\"],\n",
    "                    \"chunk_index\": i,\n",
    "                    \"total_chunks\": len(chunks),\n",
    "                    \"chunk_id\": chunk_id,\n",
    "                }\n",
    "                valid_docs.append(chunk)\n",
    "                valid_metadata.append(chunk_metadata)\n",
    "                generated_ids.append(chunk_id)\n",
    "\n",
    "        print(f\"Total chunks to encode: {len(valid_docs)}\")\n",
    "\n",
    "        if not valid_docs:\n",
    "            print(\"No valid documents to index.\")\n",
    "            return 0\n",
    "\n",
    "        doc_embeddings = self.model.encode(valid_docs)\n",
    "\n",
    "        batch_size = 100\n",
    "        for i in range(0, len(valid_docs), batch_size):\n",
    "            batch_docs = valid_docs[i : i + batch_size]\n",
    "            batch_metadata = valid_metadata[i : i + batch_size]\n",
    "            batch_embeddings = doc_embeddings[i : i + batch_size]\n",
    "\n",
    "            vectors = [\n",
    "                (\n",
    "                    generated_ids[i + j],\n",
    "                    embedding.tolist(),\n",
    "                    {**metadata, \"text\": doc[:1000]},\n",
    "                )\n",
    "                for j, (doc, embedding, metadata) in enumerate(\n",
    "                    zip(batch_docs, batch_embeddings, batch_metadata)\n",
    "                )\n",
    "            ]\n",
    "\n",
    "            try:\n",
    "                self.index.upsert(vectors=vectors)\n",
    "                print(f\"Successfully indexed batch of {len(vectors)} chunks.\")\n",
    "            except Exception as e:\n",
    "                print(f\"Error during upsert: {e}\")\n",
    "\n",
    "        print(\"Document indexing completed.\")\n",
    "        return len(valid_docs)\n",
    "\n",
    "    def search(\n",
    "        self, query: str, top_k: int = 5, filter_query: Optional[Dict] = None\n",
    "    ) -> List[Tuple[str, Dict]]:\n",
    "        \"\"\"\n",
    "        Search for the top_k most relevant documents based on the input query and optional filter.\n",
    "\n",
    "        Args:\n",
    "            query (str): The search query text.\n",
    "            top_k (int): The number of top relevant documents to return.\n",
    "            filter_query (Optional[Dict]): Optional filter query to apply during search.\n",
    "\n",
    "        Returns:\n",
    "            List[Tuple[str, Dict]]: List of top_k relevant document texts and their metadata.\n",
    "                                    Each tuple contains (document_text, metadata_dict).\n",
    "        \"\"\"\n",
    "        query_embedding = self.model.encode(query)\n",
    "\n",
    "        try:\n",
    "            results = self.index.query(\n",
    "                vector=query_embedding.tolist(),\n",
    "                top_k=top_k,\n",
    "                filter=filter_query,\n",
    "                include_metadata=True,\n",
    "            )\n",
    "\n",
    "            return [\n",
    "                (\n",
    "                    match.metadata[\"text\"],\n",
    "                    {k: v for k, v in match.metadata.items() if k != \"text\"},\n",
    "                )\n",
    "                for match in results.matches\n",
    "            ]\n",
    "        except Exception as e:\n",
    "            print(f\"Error during search: {e}\")\n",
    "            return []\n",
    "\n",
    "    def delete_index(self) -> None:\n",
    "        \"\"\"\n",
    "        Delete the Pinecone index.\n",
    "\n",
    "        Raises:\n",
    "            Exception: If there's an error during the deletion process.\n",
    "        \"\"\"\n",
    "        try:\n",
    "            self.pc.delete_index(self.index_name)\n",
    "            print(f\"Index '{self.index_name}' has been deleted.\")\n",
    "        except Exception as e:\n",
    "            print(f\"Error deleting index: {e}\")\n",
    "\n",
    "    def extract_all_chunks_from_index(self, max_chunks: int = 10000) -> pd.DataFrame:\n",
    "        \"\"\"\n",
    "        Extract all document chunks and metadata from the Pinecone index into a DataFrame.\n",
    "\n",
    "        Args:\n",
    "            max_chunks (int): Maximum number of chunks to retrieve.\n",
    "\n",
    "        Returns:\n",
    "            pd.DataFrame: DataFrame containing chunk data and metadata.\n",
    "                          Columns include all metadata fields stored in the index.\n",
    "\n",
    "        Raises:\n",
    "            Exception: If there's an error retrieving chunks from the index.\n",
    "        \"\"\"\n",
    "        stats = self.index.describe_index_stats()\n",
    "        total_vectors = stats.total_vector_count\n",
    "        dimension = stats.dimension\n",
    "\n",
    "        print(f\"Index name: {self.index_name}\")\n",
    "        print(f\"Total vectors according to stats: {total_vectors}\")\n",
    "        print(f\"Vector dimension: {dimension}\")\n",
    "\n",
    "        try:\n",
    "            results = self.index.query(\n",
    "                vector=[0.0] * dimension,\n",
    "                top_k=max_chunks,\n",
    "                include_values=False,\n",
    "                include_metadata=True,\n",
    "            )\n",
    "\n",
    "            chunk_data = [match.metadata for match in results.matches]\n",
    "            chunk_df = pd.DataFrame(chunk_data)\n",
    "\n",
    "            print(f\"Total chunks retrieved: {len(chunk_df)}\")\n",
    "            return chunk_df\n",
    "\n",
    "        except Exception as e:\n",
    "            print(f\"Error retrieving chunks from index: {e}\")\n",
    "            return pd.DataFrame()\n",
    "\n",
    "    def delete_chunks(self, chunk_ids: List[str]) -> None:\n",
    "        \"\"\"\n",
    "        Delete specific chunks from the Pinecone index.\n",
    "\n",
    "        Args:\n",
    "            chunk_ids (List[str]): List of chunk IDs to delete.\n",
    "\n",
    "        Raises:\n",
    "            Exception: If there's an error during the deletion process.\n",
    "        \"\"\"\n",
    "        try:\n",
    "            self.index.delete(ids=chunk_ids)\n",
    "            print(f\"Successfully deleted {len(chunk_ids)} chunks from the index.\")\n",
    "        except Exception as e:\n",
    "            print(f\"Error deleting chunks: {e}\")\n",
    "            print(f\"Problematic chunk IDs: {chunk_ids}\")\n",
    "\n",
    "        print(f\"Finished deletion process for {len(chunk_ids)} chunks.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "04890aea",
   "metadata": {},
   "source": [
    "Let's use the RAG pipeline we have defined above to create a Pinecone index - then we upsert our document chunks into our vector DB in batches."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "id": "076081ad",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create RAG pipeline on unfiltered documents\n",
    "rag_pipeline = PineconeRAGPipeline(index_name=\"cleanlab-pinecone-tutorial-index\")\n",
    "rag_pipeline.index_documents(topics_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ab666388",
   "metadata": {},
   "source": [
    "After upserting our document chunks, let's now confirm that the upsertion worked properly (after waiting 30 seconds to give the vector DB time to update) by reading the document chunks into a new DataFrame `chunk_df`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "id": "2783e5b0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Index name: cleanlab-pinecone-tutorial-index\n",
      "Total vectors according to stats: 2432\n",
      "Vector dimension: 384\n",
      "Total chunks retrieved: 2432\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>chunk_id</th>\n",
       "      <th>chunk_index</th>\n",
       "      <th>filename</th>\n",
       "      <th>text</th>\n",
       "      <th>topic</th>\n",
       "      <th>total_chunks</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>3078a09c-799c-474e-85e7-a9892411e76c</td>\n",
       "      <td>4.0</td>\n",
       "      <td>documents/Walmart_2022_Investor_Information.pdf</td>\n",
       "      <td>follows:2023HighLow1st Quarter(1)$146.94$132.0...</td>\n",
       "      <td>finance</td>\n",
       "      <td>9.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4b21ab0f-01f2-48a3-a307-b2e3d90177e6</td>\n",
       "      <td>5.0</td>\n",
       "      <td>documents/2012_14.doc</td>\n",
       "      <td>Pay transactions on HUE01 using the following ...</td>\n",
       "      <td>hr</td>\n",
       "      <td>17.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>c30ad6e9-6b96-44ba-8fe9-07a69155c5ec</td>\n",
       "      <td>5.0</td>\n",
       "      <td>documents/internet_safety.pptx</td>\n",
       "      <td>751-5980 (800) 487-1626 (8 a.m. to 5 p.m. CST,...</td>\n",
       "      <td>unknown</td>\n",
       "      <td>6.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>f5ab2c30-0ac6-4ae4-b22f-3c3c50de5a6c</td>\n",
       "      <td>139.0</td>\n",
       "      <td>documents/Investor_Transcript_2023-10-26.pdf</td>\n",
       "      <td>PROVIDE AN ACCURATE TRANSCRIPTION, THERE MAY B...</td>\n",
       "      <td>finance</td>\n",
       "      <td>141.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>7b089b6f-30a4-474f-ac7c-c91ba7143728</td>\n",
       "      <td>12.0</td>\n",
       "      <td>documents/2012_14.doc</td>\n",
       "      <td>When entering deduction override amounts for a...</td>\n",
       "      <td>hr</td>\n",
       "      <td>17.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                               chunk_id  chunk_index  \\\n",
       "0  3078a09c-799c-474e-85e7-a9892411e76c          4.0   \n",
       "1  4b21ab0f-01f2-48a3-a307-b2e3d90177e6          5.0   \n",
       "2  c30ad6e9-6b96-44ba-8fe9-07a69155c5ec          5.0   \n",
       "3  f5ab2c30-0ac6-4ae4-b22f-3c3c50de5a6c        139.0   \n",
       "4  7b089b6f-30a4-474f-ac7c-c91ba7143728         12.0   \n",
       "\n",
       "                                          filename  \\\n",
       "0  documents/Walmart_2022_Investor_Information.pdf   \n",
       "1                            documents/2012_14.doc   \n",
       "2                   documents/internet_safety.pptx   \n",
       "3     documents/Investor_Transcript_2023-10-26.pdf   \n",
       "4                            documents/2012_14.doc   \n",
       "\n",
       "                                                text    topic  total_chunks  \n",
       "0  follows:2023HighLow1st Quarter(1)$146.94$132.0...  finance           9.0  \n",
       "1  Pay transactions on HUE01 using the following ...       hr          17.0  \n",
       "2  751-5980 (800) 487-1626 (8 a.m. to 5 p.m. CST,...  unknown           6.0  \n",
       "3  PROVIDE AN ACCURATE TRANSCRIPTION, THERE MAY B...  finance         141.0  \n",
       "4  When entering deduction override amounts for a...       hr          17.0  "
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "time.sleep(30)\n",
    "\n",
    "# Extract chunks from the Pinecone index into a DataFrame\n",
    "chunk_df = rag_pipeline.extract_all_chunks_from_index()\n",
    "\n",
    "# Display the resulting DataFrame\n",
    "chunk_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "id": "bcfbd8c5",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(2432, 6)"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chunk_df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "id": "1603f18d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The columns in our documents data are: ['chunk_id', 'chunk_index', 'filename', 'text', 'topic', 'total_chunks']\n"
     ]
    }
   ],
   "source": [
    "document_data_columns = list(chunk_df.columns)\n",
    "print(f\"The columns in our documents data are: {document_data_columns}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f07a408d",
   "metadata": {},
   "source": [
    "We can see below that the number of chunks per document varies:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "id": "94377992",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "filename\n",
      "documents/10-K 2022-Apple.pdf                                               3.0\n",
      "documents/10-K 2022-Apple2.pdf                                             16.0\n",
      "documents/2012_14.doc                                                      17.0\n",
      "documents/2020_-_I-9_Employment_Verification_Training_Presentation.ppt     17.0\n",
      "documents/2023-CEASampleEmployeeHandbook.doc                              286.0\n",
      "                                                                          ...  \n",
      "documents/the-selling-process-1.pdf                                         7.0\n",
      "documents/the-selling-process-2.pdf                                        10.0\n",
      "documents/unit04.pdf                                                       22.0\n",
      "documents/what-is-selling.pptx                                              7.0\n",
      "documents/work-ez-specs.pdf                                                 2.0\n",
      "Name: chunk_index, Length: 95, dtype: float64\n"
     ]
    }
   ],
   "source": [
    "chunks_per_doc = chunk_df.groupby(\"filename\")[\"chunk_index\"].max() + 1\n",
    "print(chunks_per_doc)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c17b8a11",
   "metadata": {},
   "source": [
    "## Use Cleanlab's TLM to detect and filter out bad document chunks"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aa6f410b",
   "metadata": {},
   "source": [
    "Now that we have extracted our document chunks from our Pinecone index, let's use Cleanlab's TLM (via the helper functions we defined earlier to tag documents) to detect low quality document chunks (i.e. HTML, not interpretable/cut off phrases, or non English text) and personally identifiable information (PII) to filter out these chunks from our vector DB."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9698b0f0",
   "metadata": {},
   "source": [
    "Below we define the prompts we will use to find the document chunks that are badly chunked or contain PII and run our `classify()` helper function to obtain the response and trustworthiness scores from TLM."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "id": "9b2b533c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Bad chunks detection (you can replace this with your own definitions)\n",
    "bad_chunks_prompt = \"\"\"\n",
    "I am chunking documents into smaller pieces to create a knowledge base for question-answering systems. \n",
    "\n",
    "Task: Help me check if the following Text is badly chunked. A badly chunked text is any text that is: full of HTML/XML and other non-language strings or non-english words, has hardly any informative content or missing key information, or text that contains Personally-Identifiable Information (PII) and other sensitive confidential information.\n",
    "\n",
    "Return 'bad_chunk' if the provided Text is badly chunked, and 'good_chunk' otherwise. Please be as accurate as possible, the world depends on it.\n",
    "\n",
    "Text: {text}\n",
    "\"\"\"\n",
    "\n",
    "bad_chunk_categories = [\"bad_chunk\", \"good_chunk\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "id": "8b903d58",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Querying TLM... 100%|██████████|\n"
     ]
    }
   ],
   "source": [
    "bad_chunk_predictions, bad_chunk_scores = classify(chunk_df[\"text\"].tolist(), bad_chunk_categories, bad_chunks_prompt)\n",
    "\n",
    "chunk_df['chunk_quality'] = bad_chunk_predictions\n",
    "chunk_df['chunk_quality_trustworthiness'] = bad_chunk_scores"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "id": "d538526d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# PII detection\n",
    "pii_prompt = \"\"\"\n",
    "I am chunking documents into smaller pieces to create a knowledge base for question-answering systems. \n",
    "\n",
    "Task: Analyze the following text and determine if it has personally identifiable information (PII). PII is information that could be used to identify an individual or is otherwise sensitive. Names, addresses, phone numbers are examples of common PII.\n",
    "\n",
    "Return 'is_PII' if the text contains PII and 'no_PII' if it does not. Please be as accurate as possible, the world depends on it.\n",
    "\n",
    "Text: {text}\n",
    "\"\"\"\n",
    "\n",
    "pii_categories = [\"is_PII\", \"no_PII\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "id": "8c3ca17e",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Querying TLM...   0%|          |"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Querying TLM... 100%|██████████|\n"
     ]
    }
   ],
   "source": [
    "pii_predictions, pii_scores = classify(chunk_df[\"text\"].tolist(), pii_categories, pii_prompt)\n",
    "\n",
    "chunk_df['pii_check'] = pii_predictions\n",
    "chunk_df['pii_check_trustworthiness'] = pii_scores"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "id": "dd151a72",
   "metadata": {},
   "outputs": [],
   "source": [
    "sorted_chunk_quality_df = chunk_df.sort_values(\n",
    "    by=\"chunk_quality_trustworthiness\", ascending=False\n",
    ").copy()\n",
    "sorted_is_pii_df = chunk_df.sort_values(\n",
    "    by=\"pii_check_trustworthiness\", ascending=False\n",
    ").copy()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6a41cf91",
   "metadata": {},
   "source": [
    "Now let's check how many document chunks are bad chunks or contain PII but also in which the response trustworthiness score is >= 0.95, which represent the most trustworthy responses using TLM. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "id": "abbd98d2",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of document chunks that have the worst chunk quality based on trustworthiness: 87\n",
      "Number of document chunks have the worst PII based on trustworthiness: 127\n"
     ]
    }
   ],
   "source": [
    "worst_chunks = sorted_chunk_quality_df.query(\n",
    "    \"chunk_quality == 'bad_chunk' and chunk_quality_trustworthiness >= 0.95\"\n",
    ")\n",
    "worst_pii = sorted_is_pii_df.query(\n",
    "    \"pii_check == 'is_PII' and pii_check_trustworthiness >= 0.95\"\n",
    ")\n",
    "\n",
    "print(\n",
    "    f\"Number of document chunks that have the worst chunk quality based on trustworthiness: {worst_chunks.shape[0]}\"\n",
    ")\n",
    "print(\n",
    "    f\"Number of document chunks have the worst PII based on trustworthiness: {worst_pii.shape[0]}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1d288882",
   "metadata": {},
   "source": [
    "We have observed the capabilities that Cleanlab's TLM can provide in detecting these low quality document chunks, so let's now:\n",
    "\n",
    "1. Update our Pinecone DB by removing the chunks with any of the issues detected by Cleanlab\n",
    "2. Verify that the update to our Pinecone DB worked"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "116b3fc7",
   "metadata": {},
   "source": [
    "Let's observe some of the worst chunks that we are going to delete:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "id": "0567869f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "140,001 - 150,000 150,001 and over0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15$0 - $8,000 8,001 - 17,000 17,001 - 26,000 26,001 - 34,000 34,001 - 44,000 44,001 - 75,000 75,001 - 85,000 85,001 - 110,000 110,001 - 125,000 125,001 - 140,000 140,001 and over0 1 2 3 4 5 6 7 8 9 10$0 - $75,000 75,001\n",
      "\n",
      "\n",
      "718    3, 5 2019 3, 5 2018 3, 4, 52,5522,3561,9621,77...\n",
      "Name: text, dtype: object\n"
     ]
    }
   ],
   "source": [
    "print(worst_chunks.iloc[0][\"text\"])\n",
    "print(\"\\n\")\n",
    "print(worst_chunks.iloc[1:2][\"text\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "90d8c0d6",
   "metadata": {},
   "source": [
    "These chunks are clearly not easy to understand nor do they represent full English phrases.\n",
    "\n",
    "Now let's observe some of the worst examples of PII that we are going to delete:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "id": "31784966",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "209    Cathy McGill at (804) 371-7800 or Email at cat...\n",
      "Name: text, dtype: object\n",
      "\n",
      "\n",
      "438    (TTY) (800) 700-2320; or visit the department’...\n",
      "Name: text, dtype: object\n"
     ]
    }
   ],
   "source": [
    "print(worst_pii.iloc[2:3][\"text\"])\n",
    "print(\"\\n\")\n",
    "print(worst_pii.iloc[6:7][\"text\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "66d3a698",
   "metadata": {},
   "source": [
    "These examples clearly contain sensitive information that is flagged by Cleanlab - so these definitely contain PII that can be removed!\n",
    "\n",
    "We can now construct the list of chunks to delete based on the `chunk_id` we previously created for each chunk that we can tie to each chunk in our vector DB."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "id": "7c350a6a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get the chunk IDs for the chunks that have issues to update our Pinecone index\n",
    "worst_chunks_to_delete_ids = worst_chunks[\"chunk_id\"].tolist()\n",
    "worst_pii_to_delete_ids = worst_pii[\"chunk_id\"].tolist()\n",
    "chunks_to_delete_ids = list(set(worst_chunks_to_delete_ids + worst_pii_to_delete_ids))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "id": "c1d372b1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "195"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(chunks_to_delete_ids)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "id": "402083dd",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Successfully deleted 195 chunks from the index.\n",
      "Finished deletion process for 195 chunks.\n"
     ]
    }
   ],
   "source": [
    "# Delete the identified chunks from the index\n",
    "rag_pipeline.delete_chunks(chunks_to_delete_ids)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d48f24b0",
   "metadata": {},
   "source": [
    "After deleting our document chunks, let's now confirm that the deletion worked properly (after waiting 30 seconds to give the vector DB time to update) by reading the document chunks into a new DataFrame `updated_chunk_df`.\n",
    "\n",
    "We will also then confirm the number of chunks deleted is equal to the total number of original chunks minus the number of chunks after filtering out bad chunks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "id": "00245478",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Index name: cleanlab-pinecone-tutorial-index\n",
      "Total vectors according to stats: 2237\n",
      "Vector dimension: 384\n",
      "Total chunks retrieved: 2237\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>chunk_id</th>\n",
       "      <th>chunk_index</th>\n",
       "      <th>filename</th>\n",
       "      <th>text</th>\n",
       "      <th>topic</th>\n",
       "      <th>total_chunks</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>f5ab2c30-0ac6-4ae4-b22f-3c3c50de5a6c</td>\n",
       "      <td>139.0</td>\n",
       "      <td>documents/Investor_Transcript_2023-10-26.pdf</td>\n",
       "      <td>PROVIDE AN ACCURATE TRANSCRIPTION, THERE MAY B...</td>\n",
       "      <td>finance</td>\n",
       "      <td>141.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3078a09c-799c-474e-85e7-a9892411e76c</td>\n",
       "      <td>4.0</td>\n",
       "      <td>documents/Walmart_2022_Investor_Information.pdf</td>\n",
       "      <td>follows:2023HighLow1st Quarter(1)$146.94$132.0...</td>\n",
       "      <td>finance</td>\n",
       "      <td>9.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>194cb427-9583-4858-bf58-5a5a92ea903c</td>\n",
       "      <td>0.0</td>\n",
       "      <td>documents/NHO 2015 - 2016_6-15.pps</td>\n",
       "      <td>Tennessee State UniversityNew EmployeeBenefits...</td>\n",
       "      <td>hr</td>\n",
       "      <td>25.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4b21ab0f-01f2-48a3-a307-b2e3d90177e6</td>\n",
       "      <td>5.0</td>\n",
       "      <td>documents/2012_14.doc</td>\n",
       "      <td>Pay transactions on HUE01 using the following ...</td>\n",
       "      <td>hr</td>\n",
       "      <td>17.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>9caf3145-2e0b-4df9-80ec-57de1c28d10d</td>\n",
       "      <td>9.0</td>\n",
       "      <td>documents/Number_8.0_DISCIPLINARY_ACTION_AND_C...</td>\n",
       "      <td>be reviewed by the Department of Human Resourc...</td>\n",
       "      <td>hr</td>\n",
       "      <td>33.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                               chunk_id  chunk_index  \\\n",
       "0  f5ab2c30-0ac6-4ae4-b22f-3c3c50de5a6c        139.0   \n",
       "1  3078a09c-799c-474e-85e7-a9892411e76c          4.0   \n",
       "2  194cb427-9583-4858-bf58-5a5a92ea903c          0.0   \n",
       "3  4b21ab0f-01f2-48a3-a307-b2e3d90177e6          5.0   \n",
       "4  9caf3145-2e0b-4df9-80ec-57de1c28d10d          9.0   \n",
       "\n",
       "                                            filename  \\\n",
       "0       documents/Investor_Transcript_2023-10-26.pdf   \n",
       "1    documents/Walmart_2022_Investor_Information.pdf   \n",
       "2                 documents/NHO 2015 - 2016_6-15.pps   \n",
       "3                              documents/2012_14.doc   \n",
       "4  documents/Number_8.0_DISCIPLINARY_ACTION_AND_C...   \n",
       "\n",
       "                                                text    topic  total_chunks  \n",
       "0  PROVIDE AN ACCURATE TRANSCRIPTION, THERE MAY B...  finance         141.0  \n",
       "1  follows:2023HighLow1st Quarter(1)$146.94$132.0...  finance           9.0  \n",
       "2  Tennessee State UniversityNew EmployeeBenefits...       hr          25.0  \n",
       "3  Pay transactions on HUE01 using the following ...       hr          17.0  \n",
       "4  be reviewed by the Department of Human Resourc...       hr          33.0  "
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Sleep timer to allow the deletion process to complete\n",
    "time.sleep(30)\n",
    "\n",
    "# Verify the update went through and we have the correct number of chunks\n",
    "updated_chunk_df = rag_pipeline.extract_all_chunks_from_index()\n",
    "updated_chunk_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "id": "be9baa4c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Confirm the number of chunks deleted is equal to the total number of original chunks minus the number of chunks after filtering out bad chunks\n",
    "assert len(chunk_df) - len(updated_chunk_df) == len(chunks_to_delete_ids)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4b3908e5",
   "metadata": {},
   "source": [
    "**Note:** You can also run this data ingestion (tagging, chunking, and cleaning) into a Pinecone vector DB index in real-time as new documents are being ingested. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7eb9c280-1c36-4bd5-8bf4-e3a09059353d",
   "metadata": {},
   "source": [
    "### How to search for documents with metadata\n",
    "\n",
    "Below is an example on how you can search for your curated documents based on a query of your choice and use metadata to help filter for the relevant information. \n",
    "\n",
    "In this case, we specify `topic = sales` to find the `top k` (where k is number of results to return and equal to 2 here) documents that best match the search query/metadata filters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "id": "f825333b-f2aa-44fb-9bcb-40ef6d3cd72b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Document: learn the person’s name, position, age, education, experience, hobbies, etc. Many times, just gettin...\n",
      "Metadata: {'chunk_id': 'e9cb9756-2426-44e1-ada0-f50e7d8c17ae', 'chunk_index': 5.0, 'filename': 'documents/the-selling-process-2.pdf', 'topic': 'sales', 'total_chunks': 10.0}\n",
      "--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\n",
      "Document: often used in a streamlined sales process.You have completed the topic for the sales process. Thank ...\n",
      "Metadata: {'chunk_id': '62970e10-536e-4c0d-b925-6a632aebfb1b', 'chunk_index': 14.0, 'filename': 'documents/sap-sales-overview.pdf', 'topic': 'sales', 'total_chunks': 15.0}\n",
      "--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\n"
     ]
    }
   ],
   "source": [
    "results = rag_pipeline.search(\n",
    "    query=\"Tell me about sales\",  # YOUR SEARCH QUERY HERE\n",
    "    top_k=2,\n",
    "    filter_query={\"topic\": {\"$eq\": \"sales\"}},\n",
    ")\n",
    "\n",
    "for doc_text, metadata in results:\n",
    "    print(f\"Document: {doc_text[:100]}...\")\n",
    "    print(f\"Metadata: {metadata}\")\n",
    "    print(\"-\" * 500)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "160b19ad",
   "metadata": {},
   "source": [
    "## Use Cleanlab's TLM for intent classification to classify metadata\n",
    "\n",
    "Suppose our user is asked which topic their question is about from a pre-defined list. Alternatively, we could train a classifier to predict the topic from the question (automatically using Cleanlab's TLM). This way, the intent classification is done automatically (and reliably) to obtain our topic that pertains to a user question. This topic can then be used for metadata filtering when querying against our vector DB."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "id": "0febc6be",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Intent classification\n",
    "intent_classification_prompt = \"\"\"\n",
    "You are an assistant for classifying a question as one of several topics. The available topics are:\n",
    "\n",
    "1. 'finance': Related to financial matters, budgeting, accounting, investments, lending, or monetary policies.\n",
    "2. 'hr': Pertaining to Human Resources, including hiring, employee documents (such as a W4 form), employee management, benefits, or workplace policies.\n",
    "3. 'it': Covering Information Technology topics such as software development, network infrastructure, cybersecurity, or tech support.\n",
    "4. 'product': Dealing with a specific company product, product development, management, features, or lifecycle.\n",
    "5. 'sales': Involving selling a product, customer acquisition, revenue generation, or sales performance.\n",
    "\n",
    "If you are not sure which topic to classify the question as, then answer 'unknown'.\n",
    "\n",
    "Task: Analyze the following question and determine the topic it belongs to. Return the topic as a string.\n",
    "\n",
    "Now here is the question to verify:\n",
    "\n",
    "Text: {text}\n",
    "\n",
    "Topic: \n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "id": "1325b346",
   "metadata": {},
   "outputs": [],
   "source": [
    "question = \"What were Blackstone's fee-related earnings in the third quarter of 2023?\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "933c407b",
   "metadata": {},
   "source": [
    "Using the question and prompt defined above, let's now use TLM to classify the question into the correct topic (intent classification) that we will use as the metadata to filter our vector DB query with:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "id": "4fa79f80",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Querying TLM... 100%|██████████|"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'metadata_response': 'finance', 'metadata_trustworthiness_score': 0.9059273817223932}\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "intent_predictions, intent_trustworthiness_scores = classify([question], categories, intent_classification_prompt)\n",
    "\n",
    "metadata_topic_response = {\n",
    "    \"metadata_response\": intent_predictions[0],\n",
    "    \"metadata_trustworthiness_score\": intent_trustworthiness_scores[0]\n",
    "}\n",
    "\n",
    "print(metadata_topic_response)\n",
    "\n",
    "# Extract the topic from the response\n",
    "metadata_topic = metadata_topic_response[\"metadata_response\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "edd14f51",
   "metadata": {},
   "source": [
    "Below, you can then specify the classified metadata topic (or other relevant metadata) when filtering to help search for the relevant context in your Pinecone index.\n",
    "\n",
    "Since we don't know where to stick the documents with an `unknown` topic value, we will always include the unknown topic in the filter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "id": "6ae30c8e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Top document chunk for this query: \n",
      "\n",
      "of year-over-year base management fee growth at Blackstone. Fee-related earnings were $1.1 billion in the third quarter, or $0.92 per share, largely stable with Q2, underpinned by steady top-line performance, along with the firm's strong margin position. The year-over-year comparison was affected by a decline in transaction fees, which are activity-based, as well as lower fee-related performance revenues. Notwithstanding these headwinds, the firm generated $275 million of fee-related performance revenues in\n"
     ]
    }
   ],
   "source": [
    "# Define the list of topics to filter on\n",
    "topic_filter = [metadata_topic, \"unknown\"]\n",
    "\n",
    "# Use the classified topic to filter the search results\n",
    "top_doc_chunk = rag_pipeline.search(\n",
    "    question, top_k=1, filter_query={\"topic\": {\"$in\": topic_filter}}\n",
    ")\n",
    "print(f\"Top document chunk for this query: \\n\\n{top_doc_chunk[0][0]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2de70264-3808-49df-bad1-87673534453d",
   "metadata": {},
   "source": [
    "## Use Cleanlab's TLM to get Trustworthiness Score for RAG Outputs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d2f32e3c-3933-4544-aa39-49cdc8437e45",
   "metadata": {},
   "source": [
    "Now let's use the context from our top document chunk we obtained in the previous step to actually answer the original question asked: `What were Blackstone's fee-related earnings in the third quarter of 2023?`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "id": "e2e04c82",
   "metadata": {},
   "outputs": [],
   "source": [
    "top_doc_chunks = \"\".join(\n",
    "    rag_pipeline.search(\n",
    "        question, top_k=1, filter_query={\"topic\": {\"$in\": topic_filter}}\n",
    "    )[0][0]\n",
    ")\n",
    "\n",
    "prompt = f\"\"\"You are an assistant for answering the following question based on the document context.\n",
    "\n",
    "Question: {question}\n",
    "Document Context: {top_doc_chunks}\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "57bec132",
   "metadata": {},
   "source": [
    "Now let's query TLM using the prompt we previously created using our user query + document context passed from our RAG Pinecone index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "id": "7c3f8830",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'response': \"Blackstone's fee-related earnings in the third quarter of 2023 were $1.1 billion, or $0.92 per share.\",\n",
       " 'trustworthiness_score': 0.95175686577078}"
      ]
     },
     "execution_count": 68,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "output = tlm.prompt(prompt)\n",
    "output"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e9a8dcac-518d-420c-aee5-51f8566bfe31",
   "metadata": {},
   "source": [
    "### Is this response correct?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "id": "8433d811",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Fee-related earnings were $1.1 billion in the third quarter'"
      ]
     },
     "execution_count": 69,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text_to_check = updated_chunk_df.query(\"filename == 'documents/Blackstone-Third-Quarter-2023-Investor-Call.pdf' and text.str.contains('of year-over-year base management fee', case=False, na=False)\").iloc[0][\"text\"]\n",
    "text_to_check[60:119]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ee253b8-1ecf-4ea2-a423-13bbe016e11e",
   "metadata": {},
   "source": [
    "### Can we find a hallucination?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2367689a",
   "metadata": {},
   "source": [
    "Let's now try to find a hallucination but first classify the intent of this new question into the correct metadata topic value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "id": "572b4ab3",
   "metadata": {},
   "outputs": [],
   "source": [
    "question = \"Can you tell me if these Good's Homestyle Potato Chips support 9 per row in the case?\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "id": "21002545",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Querying TLM... 100%|██████████|"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'metadata_response': 'product', 'metadata_trustworthiness_score': 0.951910293755863}\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "intent_predictions, intent_trustworthiness_scores = classify([question], categories, intent_classification_prompt)\n",
    "\n",
    "metadata_topic_response = {\n",
    "    \"metadata_response\": intent_predictions[0],\n",
    "    \"metadata_trustworthiness_score\": intent_trustworthiness_scores[0]\n",
    "}\n",
    "\n",
    "print(metadata_topic_response)\n",
    "\n",
    "# Extract the topic from the response\n",
    "metadata_topic = metadata_topic_response[\"metadata_response\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "54338e19",
   "metadata": {},
   "source": [
    "Now we can pass the question and context into our prompt to catch this hallucination."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "id": "214fe620",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'response': \"Yes, the Good's Homestyle Potato Chips support 9 per row in the case, as indicated by the pallet configuration of 72 per pallet (9 x 8).\",\n",
       " 'trustworthiness_score': 0.7508270454814128}"
      ]
     },
     "execution_count": 76,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Define the list of topics to filter on\n",
    "topic_filter = [metadata_topic, \"unknown\"]\n",
    "\n",
    "top_doc_chunks = \"\".join(\n",
    "    rag_pipeline.search(\n",
    "        question, top_k=1, filter_query={\"topic\": {\"$in\": topic_filter}}\n",
    "    )[0][0]\n",
    ")\n",
    "\n",
    "prompt = f\"\"\"You are an assistant for answering the following question based on the document context.\n",
    "\n",
    "Question: {question}\n",
    "Document Context: {top_doc_chunks}\n",
    "\"\"\"\n",
    "\n",
    "# Runs the Cleanlab TLM with confidence reliablity scores with default model and quality preset\n",
    "tlm = studio.TLM()\n",
    "\n",
    "output = tlm.prompt(prompt)\n",
    "output"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0b2ea643-4621-4d3c-8bf7-afc438e65fa3",
   "metadata": {},
   "source": [
    "### Is this correct?\n",
    "\n",
    "There are 24 chip bags per case which cannot support 9 per row, so this is a hallucination."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "id": "c3123ebe",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'24/case (3 x 6 & 6 on top)'"
      ]
     },
     "execution_count": 73,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text_to_check = updated_chunk_df.query(\"filename == 'documents/potato-chips-specs.pdf'\").iloc[0][\"text\"]\n",
    "text_to_check[113:139]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3e992780",
   "metadata": {},
   "source": [
    "### Flag less trustworthy responses with a warning for human review"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7408e21d",
   "metadata": {},
   "source": [
    "You can boost the reliability of your Generative AI applications by adding contingency plans to override LLM answers whose trustworthiness score falls below some threshold (e.g., route to human for answer, append disclaimer that answer is uncertain, revert to a default baseline answer, or request a prompt with more information/context).\n",
    "\n",
    "If you have time/resources, your team can manually review low-trustworthiness responses and provide a better human response instead.\n",
    "If not, you can determine a trustworthiness threshold below which responses seem untrustworthy, and automatically append a warning statement to any response whose trustworthiness falls below the threshold.\n",
    "\n",
    "The overall magnitude/range of the trustworthiness scores may differ between datasets, so we recommend selecting any thresholds to be application-specific. First consider the relative trustworthiness levels between different data points before considering the overall magnitude of these scores for individual data points. For more information on how to choose the appropriate threshold, you can refer to the [TLM quickstart tutorial](https://help.cleanlab.ai/tlm/tutorials/tlm/)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a29306ae",
   "metadata": {},
   "source": [
    "Since our hallucination has a trustworthiness score of `0.75` from our example above, we will set the threshold to `0.8` and append a warning statement to any response whose trustworthiness score falls below this threshold so that a human reviewer is aware of the potential untrustworthiness of the response. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "id": "419a0801",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Trustworthiness score: 0.7508270454814128\n",
      "\n",
      "Processed response with warning: Yes, the Good's Homestyle Potato Chips support 9 per row in the case, as indicated by the pallet configuration of 72 per pallet (9 x 8).\n",
      " CAUTION: THIS ANSWER HAS BEEN FLAGGED AS POTENTIALLY UNTRUSTWORTHY\n"
     ]
    }
   ],
   "source": [
    "# Process the response based on trustworthiness threshold\n",
    "threshold = 0.8  # set this after inspecting responses around different trustworthiness ranges\n",
    "if output['trustworthiness_score'] < threshold:\n",
    "    output['response'] = output['response'] + \"\\n CAUTION: THIS ANSWER HAS BEEN FLAGGED AS POTENTIALLY UNTRUSTWORTHY\"\n",
    "\n",
    "# Example using our previous output\n",
    "print(\"Trustworthiness score:\", output['trustworthiness_score'])\n",
    "print(\"\\nProcessed response with warning:\", output['response'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "929c75a5-e2f3-4116-a905-0147f92c86c5",
   "metadata": {},
   "source": [
    "### Apply Cleanlab's TLM to Existing RAG Prompt/Response Pairs\n",
    "\n",
    "We can also make use of the [get_trustworthiness_score](https://help.cleanlab.ai/reference/python/trustworthy_language_model/#method-get_trustworthiness_score) function from Cleanlab's TLM Python API to generate a trustworthiness score for prompt/response pairs that you already have. This can be relevant for a RAG system in which you already have retrieved the context and obtained a response from your LLM but still want to add trust and reliability to these input/output pairs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "id": "d825b76f-8c07-4d17-8a45-732cc572c1e7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "TLM Score for Response A: {'trustworthiness_score': 0.8271985087770306}\n",
      "TLM Score for Response B: {'trustworthiness_score': 0.0635722930167351}\n"
     ]
    }
   ],
   "source": [
    "# Redefine TLM to use the best quality preset and GPT-4o model\n",
    "tlm = studio.TLM(quality_preset=\"best\", options={\"model\": \"gpt-4o\"})\n",
    "\n",
    "prompt = \"\"\"Based on the following documents, answer the given question. \n",
    "\n",
    "Documents:  We're getting new desks! Specs are here: Staples Model RTG120XLDBL BasePage \\\n",
    "CollectionModel | Dimensions width = 60.0in height = 48.0in depth = 24.0in Base Color Black \\\n",
    "Top Color White | Specs SheetsPowered by TCPDF (www.tcpdf.org)\n",
    "\n",
    "Question: What is the width of the new desks?\n",
    "\"\"\"\n",
    "\n",
    "response_A = \"60 inches\"\n",
    "response_B = \"24 inches\"\n",
    "\n",
    "trust_score_A = tlm.get_trustworthiness_score(prompt, response_A)\n",
    "trust_score_B = tlm.get_trustworthiness_score(prompt, response_B)\n",
    "\n",
    "print(f\"TLM Score for Response A: {trust_score_A}\")\n",
    "print(f\"TLM Score for Response B: {trust_score_B}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b4cc73e6",
   "metadata": {},
   "source": [
    "The trustworthiness score for the prompt/response pair that has the correct response is expectedly much higher (and therefore more trustworthy) than the prompt/response pair with the incorrect response."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b6438b38",
   "metadata": {},
   "source": [
    "To summarize, we used Cleanlab and Pinecone to build a reliable, curated, and accurate RAG system.\n",
    "\n",
    "We used Pinecone to create the DB to actually store our document data and help us create accurate metadata to label our documents/chunks. We used Pinecone to create the vector database, actually store our chunks of document data/metadata, and as our similarity search engine - which all play a crucial part in the storage/retrieval components of RAG. \n",
    "\n",
    "We used Cleanlab's Trustworthy Language Model (TLM) to first tag our document chunks and then ensure no badly chunked data or PII exists in the vector DB in order to curate more accurate and reliable documents. This ensures that, during retrieval from the RAG system, only the most relevant subset of  documents is considered. Lastly, we used TLM to classify metadata for use in retrieval and used TLM to show how to get trustworthiness scores for RAG Outputs to eliminate hallucinations and nonsensical answers.\n",
    "\n",
    "For more guidance on how you can leverage TLM in RAG applications, you can refer to this Cleanlab Python API [tutorial](https://help.cleanlab.ai/tutorials/tlm_rag/)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "mturk-env",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
